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Abstract 

This paper reviews the functional aspects of statistical learning theory. The main point under con- 
sideration is the nature of the hypothesis set when no prior information is available but data. Within this 
framework we first discuss about the hypothesis set: it is a vectorial space, it is a set of pointwise defined 
functions, and the evaluation functional on this set is a continuous mapping. Based on these principles an 
original theory is developed generalizing the notion of reproduction kernel Hilbert space to non hilbertian 
sets. Then it is shown that the hypothesis set of any learning machine has to be a generalized reproducing 
set. Therefore, thanks to a general "representer theorem", the solution of the learning problem is still a 
linear combination of a kernel. Furthermore, a way to design these kernels is given. To illustrate this 
framework some examples of such reproducing sets and kernels are given. 

1 Some questions regarding machine learning 

Kernels and in particular Mercer or reproducing kernels play a crucial role in statistical learning theory 
and functional estimation. But very little is known about the associated hypothesis set, the underlying 
functional space where learning machines look for the solution. How to choose it? How to build it? What 
is its relationship with regularization? The machine learning community has been interested in tackling 
the problem the other way round. For a given learning task, therefore for a given hypothesis set, is there 
a learning machine capable of learning it? The answer to such a question allows to distinguish between 
learnable and non-learnable problem. The remaining question is: is there a learning machine capable of 
learning any learnable set. 

We know since 1 13 1 that learning is closely related to the approximation theory, to the generalized spline 
theory, to regularization and, beyond, to the notion of reproducing kernel Hilbert space (r.k.h.s). This 
framework is based on the minimization of the empirical cost plus a stabilizer {i.e. a norm is some Hilbert 
space). Then, under these conditions, the solution to the learning task is a linear combination of some 
positive kernel whose shape depends on the nature of the stabilizer. This solution is characterized by strong 
and nice properties such as universal consistency. 

But within this framework there remains a gap between theory and practical solutions implemented by 
practitioners. For instance, in r.k.h.s, kernels are positive. Some practitioners use hyperbolic tangent 
kernel tanh(w T x + wq) while it is not a positive kernel: but it works. Another example is given by 
practitioners using non-hilbertian framework. The sparsity upholder uses absolute values such as J \ f\dfi 
or J2j \ a j\ : these are L 1 norms. They are not hilbertian. Others escape the hilbertian approximation 
orthodoxy by introducing prior knowledge (i.e. a stabilizer) through information type criteria that are not 
norms. 

This paper aims at revealing some underlying hypothesis of the learning task extending the reproducing 
kernel Hilbert space framework. To do so we begin with reviewing some learning principle. We will stress 
that the hilbertian nature of the hypothesis set is not necessary while the reproducing property is. This leads 
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us to define a non hilbertian framework for reproducing kernel allowing non positive kernel, non-hilbertian 
norms and other kinds of stabilizers. 

The paper is organized as follows. The first point is to establish the three basic principles of learning. 
Based on these principles and before entering the non-hilbertian framework, it appears necessary to recall 
some basic elements of the theory of reproducing kernel Hilbert space and how to build them from non 
reproducing Hilbert space. Then the construction of non-hilbertian reproducing space is presented by 
replacing the dot (or inner) product by a more general duality map. This implies distinguishing between 
two different sets put in duality, one for hypothesis and the other one for measuring. In the hilbertian 
framework these two sets are merged in a single Hilbert space. 

But before going into technical details we think it advisable to review the use of r.k.h.s in the learning 
machine community. 

2 r.k.h.s perspective 

2.1 Positive kernels 

The interest of r.k.h.s arises from its associated kernel. As it were, a r.k.h.s is a set of functions entirely 
defined by a kernel function. A Kernel may be characterized as a function from X x X to R (usually 
X C R d ). Mercer ifTTI first establishes some remarkable properties of a particular class of kernels: positive 
kernels defining an integral operator. These kernels have to belong to some functional space (typically 
L 2 (X x X), the set of square integrable functions on X x X) so that the associated integral operator is 
compact. The positivity of kernel K is defined as follows: 

K(x,y) positive V/eL 2 , ((K, f)&, f)& > 

where (., .) L 2 denotes the dot product in L 2 . Then, because it is compact, the kernel operator admits a 
countable spectrum and thus the kernel can be decomposed. Based on that, the work by Aronszajn [2| can 
be presented as follows. Instead of defining the kernel operator from L 2 to L 2 Aronszajn focuses on the 
r.k.h.s H embeded with its dot product (., .)#-. In this framework the kernel has to be a pointwise defined 
function. The positivity of kernel K is then defined as follows: 

K(x, y) positive Vg £ H, {{K, g) H , g) H > (1) 

Aronszajn first establishes a bijection between kernel and r.k.h.s. Then L. Schwartz [16] shows that this 
was a particular case of a more general situation. The kernel doesn't have to be a genuine function. He 
generalizes the notion of positive kernels to weakly continuous linear application from the dual set E* of a 
vector space E to itself. To share interesting properties the kernel has to be positive in the following sense: 

K positive V7i 6 E* ((K(h), h) E ,E* > 

where (., -)e,e* denotes the duality product between E and its dual set E* . The positivity is no longer 
defined in terms of scalar product. But there is still a bijection between positive Schwartz kernels and 
Hilbert spaces. 

Of course this is only a short part of the story. For a detailed review on r.k.h.s and a complete literature 
survey see 01 1141 . Moreover some authors consider non-positive kernels. A generalization to Banach sets 
has been introduced 1 4 1 within the framework of the approximation theory. Non-positive kernels have been 
also introduced in Krein spaces as the difference between two positive ones (JT) and lfl6l section 12). 

2.2 r.k.h.s and learning in the literature 

The first contribution of r.k.h.s to the statistical learning theory is the regression spline algorithm. For an 
overview of this method see Wahba's book |20|. In this book two important hypothesis regarding the ap- 
plication of the r.k.h.s theory to statistics are stressed. These are the nature of pointwise defined functions 
and the continuity of the evaluation functional An important and general result in this framework is the 

'These definition are formaly given section 3.5, definition 3.1 and equation J5J 
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so-called representer theorem (9j- This theorem states that the solution of some class of approximation 
problems is a linear combination of a kernel evaluated at the training points. But only applications in one 
or two dimensions are given. This is due to the fact that, in that work, the way to build r.k.h.s was based 
on some derivative properties. For practical reason only low dimension regressors were considered by this 
means. 

Poggio and Girosi extended the framework to large input dimension by introducing radial functions through 
regularization operator [ 13 1. They show how to build such a kernel as the green functions of a differential 
operator defined by its Fourier transform. 

Support vector machines (SVM) perform another important link between kernel, sparsity and bounds on 
the generalization error 1 19 1. This algorithm is based on Mercer's theorem and on the relationship between 
kernel and dot product. It is based on the ability for positive kernel to be separated and decomposed 
according to some generating functions. But to use Mercer's theorem the kernel has to define a compact 
operator. This is the case for instance when it belongs to L 2 functions defined on a compact domain. 
Links between green functions, SVM and reproducing kernel Hilbert space were introduced in [ 8 ] and ifTTl . 
The link between r.k.h.s and bounds on a compact learning domain has been presented in a mathematical 
way by Cucker and Smale 13. 

Another important application of r.k.h.s to learning machines comes from the bayesian learning commu- 
nity. This is due to the fact that, in a probabilistic framework, a positive kernel is seen as a covariance 
function associated to a gaussian process. 

3 Three principles on the nature of the hypothesis set 

3.1 The learning problem 

A supervised learning problem is defined by a learning domain X C ]R d where d denotes the number of 
explicative variables, the learning codomain y C ]R and a n dimensional sample {(Xj, yi), i — 1,1%}: the 
training set. 

Main stream formulation of the learning problem considers the loading of a learning machine based on 
empirical data as the minimization of a given criterion with respect to some hypothesis lying in a hypothesis 
set Ti. In this framework hypotheses are functions / from X to y and the hypothesis space Ti is a functional 
space. 

Hypothesis Hi : Ti is a functional vector space 

Technically a convergence criterion is needed in Ti, i.e. Ti has to be embedded with a topology. In the 
remaining, we will always assumed Ti to be a convex topological vector space. 

Learning is also the minimization of some criterion. Very often the criterion to be minimized contains two 
terms. The first one, C, represents the fidelity of the hypothesis with respect to data while f2, the second 
one, represents the compression required to make a difference between memorizing and learning. Thus the 
learning machine solves the following minimization problem: 

min C(f( Xl ),...,f(x n ),y) + n(f) (2) 
ten 

The fact is, while writing this cost function, we implicitly assume that the value of function / at any point 
Xi is known. We will now discuss the important consequences this assumption has on the nature of the 
hypothesis space Ti. 

3.2 The evaluation functional 

By writing f(xi) we are assuming that function / can be evaluated at this point. Furthermore if we want 
to be able to use our learning machine to make a prediction for a given input x, f(x) has to exist for all 
x G X: we want pointwise defined functions. This property is far from being shared by all functions. For 
instance function sin(l/t) is not defined in 0. Hilbert space L 2 of square integrable functions is a quotient 
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space of functions defined only almost everywhere (i.e. not on the singletons {x}, x G X). L 2 functions 
are not pointwise defined because the L 2 elements are equivalence classes. 

To formalize our point of view we need to define M x as the set of all pointwise defined functions from 
X to R. For instance when X — R all finite polynomials (including constant function) belong to R*. We 
can lay down our second principle: 

Hypothesis H 2 : TL is a set of pointwise defined function (i.e. a subset of R*) 

Of course this is not enough to define a hypothesis set properly and at least another fundamental prop- 
erty is required. 

3.3 Continuity of the evaluation functional 

The pointwise evaluation of the hypothesis function is not enough. We want also the pointwise convergence 
of the hypothesis. If two functions are closed in some sense we don't want them to disagree on any point. 
Assume t is our unknown target function to be learned. For a given sample of size n a learning algorithm 
provides a hypothesis /„. Assume this hypothesis converges in some sense to the target hypothesis. Actu- 
ally the reason for hypothesis /„ is that it will be used to predict the value of t at a given x. For any x we 
want fn(x) to converge to t(x) as follows: 

/»^t^V.ie X, f n (x) t(x) 

We are not interested in global convergence properties but in local convergence properties. Note that it 
may be rather dangerous to define a learning machine without this property. Usually the topology on TL is 
defined by a norm. Then the pointwise convergence can be restated as follow: 

Va; G X, 3M X e R + such that \f(x) - t(x)\ < M x \\f - t\\ n (3) 

At any point x, the error can be controlled. 

It is interesting to restate this hypothesis with the evaluation functional 

Definition 3.1 the evaluation functional 

5 X : TL — > R 

/ — > S x f = f(x) 

Applied to the evaluation functional our prerequisite of pointwise convergence is equivalent to its continu- 
ity. 

Hypothesis H 3 : the evaluation functional is continuous on TL 

Since the evaluation functional is linear and continuous, it belongs to the topological dual of TL. We will 
see that this is the key point to get the reproducing property. 

Note that the continuity of the evaluation functional does not necessarily imply uniform convergence. But 
in many practical cases it does. To do so one additional hypothesis is needed, the constants M x have to 
be bounded: sup^g^ M x < oo. For instance this is the case when the learning domain X is bounded. 
Differences between uniform convergence and evaluation functional continuity is a deep and important 
topic for learning machine but out of the scope of this paper. 

3.4 Important consequence 

To build a learning machine we do need to choose our hypothesis set as a reproducing space to get the 
pointwise evaluation property and the continuity of this evaluation functional. But the Hilbertian structure 
is not necessary. Embedding a set of functions with the property of continuity of the evaluation functional 
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has many interesting consequences. The most useful one in the field of learning machine is the existence 
of a kernel K, a two-variable function with generation propert)@: 

t 

V/ eH, 3£ G IN, (ai) i=M such that f(x) « ^a l K(x,x i ) 

i=i 

I being a finite set of indices. Note that for practical reasons / may have a different representation. 

If the evaluation set is also a Hilbert space (a vector space embedded with a dot product) it is a repro- 
ducing kernel Hilbert space (r.k.h.s). Although not necessary, r.k.h.s are widly used for learning because 
they have a lot of nice practical properties. Before moving on more general reproducing sets, let's review 
the most important properties of r.k.h.s for learning. 

3.5 1R X the set of the pointwise defined functions on X 

In the following, the function space of the pointwise defined functions H x = {/ : X — > R} will be seen 
as a topological vector space embedded with the topology of simple convergence. 

H x will be put in duality with R 1 * 1 the set of all functions on X equal to zero everywhere except on a 
finite subset {xi, i G 1} of X. Thus all functions belonging to can be written in the following way: 

g G R 1 * 1 3{a,},i = 1 , n such that g(x) — a,!^ (x) 

i 

were the indicator function J Xj ix) is null everywhere except on Xi where it is equal to one. 

\/x G X % Ci (x) — if x ^ Xi and TL Xi (x) = 1 if x = x$ 

Note that the indicator function is closely related to the evaluation functional since they are in bijection 
through: 

V/ e R* Vz g x, 5 x (f) = ]T l x (y)f(y) - /(*) 

But formally, (R^) = span-j^} is a set of linear forms while R^' is a set of pointwise defined functions. 

4 Reproducing Kernel Hilbert Space (r.k.h.s) 

Definition 4.1 (Hilbert space) A vector space H embedded with the positive definite dot product (., is 
a Hilbert space if it is complete for the induced norm ||/||^ = (/,/)// (i.e. all Cauchy sequences converge 
in H). 

For instance R™, Vk the set of polynomials of order lower or equals to fc, L 2 , £ 2 the set of square sumable 
sequences seen as functions on IN are Hilbert spaces. L 1 and the set of bounded functions L°° are not. 

Definition 4.2 (reproduction kernel Hilbert space (r.k.h.s)) A Hilbert space (Ti, (., .)n) is a r.k.h.s if 
it is defined on R^ (pointwise defined functions) and if the evaluation functional is continuous on H (see 
the definition of continuity equation^. 

For instance R™, Vk as any finite dimensional set of genuine functions are r.k.h.s. I 2 is also a r.k.h.s. 
The Cameron-Martin space defined example 8.1.2 is a r.k.h.s while L 2 is not because it is not a set of 
pointwise functions. 

Definition 4.3 (positive kernel) A function from X x X to R is a positive kernel if it is symmetric and if 
for any finite subset {xi}, i = 1, n of X and any sequence of scalar {at}, i = 1, n 

n n 
«=1 3=1 

2 this property means that the set of all finite linear combinations of the kernel is dense in H. See proposition 4. 1 for a more precise 
statement. 
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This definition is equivalent to Aronszajn definition of positive kernel given equation ([]]). 

Proposition 4.1 (bijection between r.k.h.s and Kernel) Corollary of proposition 23 in ^761/ and theorem 
1.1.1 in [20]. There is a bijection between the set of all possible r.k.h.s and the set of all positive kernels. 

Proof. 

=> from r.k.h.s to Kernel. Let (Ti, (., .)•«) be a r.k.h.s. By hypothesis the evaluation functional S x is a continuous 
linear form so that it belongs to the topological dual of Ti. Thanks to the Riesz theorem we know that for each 
x £ X there exists a function K x (.) belonging to Ti such that for any function /(.) G Ti: 

&(/(.)) = <#.(.), /(.)>« 

K x (.) is a function from X x X to R and thus can be written as a two variable function K (x, y). This function 
is symmetric and positive since, for any real finite sequence {eti}, i = 1,1, Ej=i <XiK(x, Xi) G Ti, we have: 

'=1 J=l 

•<= from kernel to r.k.h.s. For any couple (/(.), <?(•)) of R'*' (there exist two finite sequences {a^ji = 1,£ 
and {(3j},j = l,m and two sequence of X points {xi}i = 1,1, {yj},j = l,m such that f(x) — 
Es=i Q f=i ( :r ) an d (%) = X^jLi ft l»j ( x )) we define the following bilinear form: 

I m 

(/(•),s(-)>M=EE4%!/i) 

i=i j=i 

Let Ho = {/ G R [A,] ; | (/(.), = °}- <•, -)m defines a dot product on the quotient set R'^/^o- Now 
let's define Ti as the R'^' completion for the corresponding norm. Tiisa.r.k.h.s with kernel K by construction. 

Proposition 4.2 (from basis to Kernel) Let TL be a r.k.h.s. Its kernel K can be written: 

iei 

for all orthonormal basis {e.j} ie / ofTi, I being a set of indices possibly infinite and non-countable. 

Proof. K G Ti implies there exits a real sequence {a.i\i^i such that K(x, .) = ~}2 ieI aiei(x). Then for all ei(x) 
element of the orthonormal basis: 

{K(.,y),ei(.))n = ei(y) because of if reproducing property 

and (K(.,y),ei(.)) n = <E jG j a j e A-), e i(-))n 

= E j6 / a i( e j(-). e i(0>H 

= oti because {ei} ie j is an orthonormal basis 

by identification we have cti — ei(y). 

Remark 4.1 Thanks to this results it is also possible to associate to any positive kernel a basis, possibly 
uncountable. Consequenty to proposition 4.1 we now how to associate a r.k.h.s to any positive kernel and 
we get the result because every Hilbert space admit an orthonormal basis. 

The fact that the basis is countable or uncountable (that the corresponding r.k.h.s is separable or not) has 
no consequences on the nature of the hypothesis set (see example 8.1 .7). Thus Mercer kernels are a particlar 
case of a more general situation since every Mercer kernel is positive in the Aronszajn sense (definition 
4.3) while the converse is false. Consequenty, when possible functionnal formulation is preferible to kernel 
formulation of learning algorithm. 
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5 Kernel and kernel operator 
5.1 How to build r.k. h.sl 

It is possible to build r.k.h.s from a L 2 (G, p) Hilbert space where G is a set (usualy G — X) and p a 
measure. To do so, an operator S is defined to map L 2 functions onto the set of the pointwise valued 
functions IR/* . A general way to define such an operator consists in remarking that the scalar product 
performs such a linear mapping. Based on that remark this operator is built from a family T x of L 2 (G, p) 
functions when x G X in the following way: 

Definition 5.1 (Carleman operator) Let T — {T x , x e X} be a family of L 2 (G, /i) functions. The asso- 
ciated Carleman operator S is 

S : L 2 — > M x 

f > 0(.) = (s/)(.) = <r (o ,/) £a = [r^fdfx 

Jg 

That is to say Vx s X, g(x) = (T x , /}l 2 - To make apparent the bijective restriction of S it is convenient 
to factorize it as follows: 

S : L 2 — ► L 2 /Ker(S) Im(5) -U (4) 
where L 2 /Ker(S f ) is the quotient set, T the bijective restriction of S and i the cannonical injection. 

This class of integral operators is known as Carleman operators |fl8l . Note that this operator unlike Hilbert- 
Schmidt operators need not be compact neither bounded. But when G is a compact set or when T x S 
L 2 (G x G) (it is a square integrable function with respect to both of its variables) S is a Hilbert-Schmidt 
operator. As an illustration of this property, see the gaussian example on G = X = 1R in table Q] In that 

caser K (r) L 2 (X x X% 

Proposition 5.1 (bijection between Carleman operators and the set of r.k.h.s) - Proposition 21 in 
H16V or theorems 1 and 4 in M4V . Let S be a Carleman operator. Its image set TL — Im(S) is a r.k.h.s. If 
H is a r.k.h.s there exists a measure p, on some set G and a Carleman operator S on L 2 (G, p) such that 
H = Im{S). 

Proof. 

=>■ Consider T the bijective restriction of S defined in equation l[4}. Tt — Im(5) can be embedded with the induced 
dot product defined as follows: 

Vsi(.),S2(.)£H 2 , <Si(.),9a(.)>* = (T^gi, T^g 2 ) L2 

= (A, f2) L 2 where 9l (.) = Tfi and g 2 (.) = Tf 2 

With respect to the induced norm, T is an isometry. To prove Ti is a r.k.h.s, we have to check the continuity of 
the evaluation functional. This works as follows: 

g(x) = (T/) (*) 

= (r*J) L 2 < \\T X \\ L 2 \\f\\ L 2 

< M x \\g{.)\\ n 

with M x = ||r^ || L 2 . In this framework TL reproducing kernel K verifies SY X = K{x, .). It can be built based 
onT: 

K(x,y) = (K(x,.),K(y,.)) n 

= (Tx,r y ) L 2 

<^ Let {e t },i £ I be a L 2 (G,p) orthonormal basis and {hj(.)},j £ J an orthonormal basis of Ti. We admit 
there exists a couple (G,/i) such that card(J) > card( J) (take for instance the counting measure on the suitable 

3 To clarify the not so obvious notion of pointwise denned function, whenever possible, we use the notation / when the function is 
not a pointwise defined function and /(.) denotes H x functions. Here F x (t) is a pointwise defined function with respect to variable 
x but not with respect to variable t. Thus, whenever possible, the confusing notation (r) is omitted. 
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Name 


r x (¥) 


if(a;,y) 


Cameron Martin 
Polynomial 

Gaussian 


^{x<u} 
d 

e (u) +y^ j x i e i (u) 
i=i 

(x — u) 2 

1 /Zexp 2 


min (x, y) 
x T y + l 

1/Z exp 4 



Table 1: Examples of Carleman operator and their associated reproducing kernel. Note that functions 
{ei}i=i,d are a finite subfamily of a L 2 orthonormal basis. Z and Z' are two constants. 

set). Define F x — X^ g j hj ( x ) e i as a ^ 2 family. Let T be the associated Carleman operator. The image of this 
Carleman operator is the r.k.h.s span by hj(.) since: 

V/6i 2 , (Tf)(x) = (r x ,f) L2 

= (y) hj (x)ej, onei) h i because f = y^ j a i e i 

je.J iei iei 

jeJ iEI 

and family is orthonormal since = ^ e »- 

To put this framework at work the relevant function T x has to be found. Some examples with popular 
kernels illustrating this definition are shown table Q] 

5.2 Carleman operator and the regularization operator 

The same kind of operator has been introduced by Poggio and Girosi in the regularization framework [13]. 
They proposed to define the regularization term f2(/) (defined equation|2]i by introducing a regularization 
operator P from hypothesis set 7i to L 2 such that f2(/) = This framework is very attractive since 

operator P models the prior knowledge about the solution defining its regularity in terms of derivative or 
Fourier decomposition properties. Furthermore the authors show that, in their framework, the solution of 
the learning problem is a linear combination of a kernel (a representer theorem). They also give a method- 
ology to build this kernel as the green function of a differential operator. Following [2| in its introduction 
the link between green function and r.k.h.s is straightforward when green function is a positive kernel. 
But a problem arises when operator P is chosen as a derivative operator and the resulting kernel is not 
derivable (for instance when P is the simple derivation, the associated kernel is the non-derivable function 
min(a;, y)). A way to overcome this technical difficulty is to consider things the other way round by defin- 
ing the regularization term as the norm of the function in the r.k.h.s built based on Carleman operator T. 
In this case we have ft(f) — \\/\\h = \\T~ 1 g\\ 2 L 2- Thus since T is bijective we can define operator P as: 
P = T^ 1 . This is no longer a derivative operator but a generalized derivative operator where the derivation 
is defined as the inverse of the integration (P is defined as T _1 ). 

5.3 Generalization 

It is important to notice that the above framework can be generalized to non L 2 Hilbert spaces. A way to 
see this is to use Kolmogorov's dilation theorem [7 |. Furthermore, the notion of reproducing kernel itself 
can be generalized to non-pointwise defined function by emphasizing the role played by continuity through 
positive generalized kernels called Schwartz or hilbertian kernels |[T6l . But this is out of the scope of our 
work. 
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6 Reproducing kernel spaces (RKS) 

By focusing on the relevant hypothesis for learning we are going to generalize the above framework to 
non-hilbertian spaces. 



6.1 Evaluation spaces 
Definition 6.1 (ES) 

Let Jibe a real topological vector space ( t.v.s.) on an arbitrary set X, Ti C JR/*. Ti is an evaluation space 
if and only if: 

Vx S X , ^ x ' ^ , . s ., , is continuous 

f 1 — ► x {f) = f{x) 

ES are then topological vector spaces in which St (the evaluation functional at t) is continuous, i.e. belongs 
to the topological dual H*of Ti. 

Remark 6.1 Topological vector space H x with the topology of simple convergence is by construction an 
ETS (evaluation topological space). 

In the case of normed vector space, another characterization can be given: 
Proposition 6.1 (normed ES or BES) 

Let {Ti, be a real normed vector space on an arbitrary set X, Ti C ¥L X . Ti is an evaluation kernel 

space if and only if the evaluation functional: 

Vx e X, 3M X e R, V/ e Ti, \f(x)\ < M x \\f\\ n 

if it is complete for the corresponding norme it is a Banach evaluation space (BES). 

Remark 6.2 In the case of a Hilbert space, we can identify Ti* and Ti and, thanks to the Riesz theorem, 
the evaluation functional can be seen as a function belonging to Ti: it is called the reproducing kernel. 

This is an important point: thanks to the Hilbertian structure the evaluation functional can be seen as a 
hypothesis function and therefore the solution of the learning problem can be built as a linear combination 
of this reproducing kernel taken different points. Representer theorem [9 1 demonstrates this property when 
the learning machine minimizes a regularized quadratic error criterion. We shall now generalize these 
properties to the case when no hilbertian structure is available. 



6.2 Reproducing kernels 

The key point when using Hilbert space is the dot product. When no such bilinear positive functional is 
available its role can be played by a duality map. Without dot product, the hypothesis set Ti is no longer 
in self duality. We need another set Ai to put in duality with Ti. This second set M is a set of functions 
measuring how the information I have at point x% helps me to measure the quality of the hypothesis at point 
X2- These two sets have to be in relation through a specific bilinear form. This relation is called a duality. 

Definition 6.2 (Duality between two sets) Two sets {Ti, M) are in duality if there exists a bilinear form 
C on Ti x A4 that separates Ti and M. (see MO^ for details on the topological aspect of this definition). 

Let C be such a bilinear form on Ti x M. that separate them. Then we can define a linear application 
and its reciprocal 9ft as follows: 

j-h ■ M — > Ti* Q n - Im(7«) — > M 

f — > 7w/ = £(.,/) .9 = £(.,/) 9 n9 = f 

where Ti* (resp. M*) denotes the dual set of Ti (resp. M). 
Let's take an important example of such a duality. 
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Hilbertian case 


General case 


fm x V 
\ H ) 

Riesz ' ■. 

i ^\ '■ 


(m x V j - k V 

\KX ) . . > JV[ 

i ' ■ . \y 

W '■. H 

^\ ■ ^ 
M — — >- H x 


K(s,t) = (K(s,.),K(.,t)) H 


K(s,t)=£ n (x*(S s ),x(S t )) 







Figure 1: illustration of the subduality map. 



Proposition 6.2 (duality of pointwise defined functions) Let X be any set {not necessarily compact). 
R* and B, [x] are in duality 

Proof. Let's define the bilinear application C as follows: 

C : Wi x x IR 1 * 1 — > R 

(/(■),<?(•) = £«^0) — = £/(*)ff(*) 

Another example is shown in the two following functional spaces: 

L 1 = |/ y |/| d/x < oo j and L°° = j/ ess sup |/| < ooj 

where for instance ^ denotes the Lebesgue measure. Theses two spaces are put in duality through the 
following duality map: 

C: L 1 x L°° — ► R 

/. 9 1 — * £(f, g)= / f gdn 

JX 

Definition 6.3 (Evaluation subduality) Two sets H and M form an evaluation subduality iff: 

- they are in duality through their duality map 7^, 

- they both are subsets ofH x 

- the continuity of the evaluation functional is preserved through: 

Span(5 x ) = lRX ((»*)') C lH {M) and 7r * ((*•*)') C ^(W) 

The key point is the way of preserving the continuity. Here the strategy to do so is first to consider two sets 
in duality and then to build the (weak) topology such that the dual elements are (weakly) continuous. 

Proposition 6.3 (Subduality kernel) A unique weakly continuous linear application x is associated to 
each subduality. This linear application, called the subduality kernel, is defined as follows: 

x: (B.*)' H x 

T,iei S xi ' — ► i°9M ° i^Eie/^xJ 

where i and j* are the canonical injections from TL to H x and respectively from (R*) to M! (figure 1 ). 
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Duality Evaluation subduality 




Figure 2: illustration of the building operators for reproducing kernel subduality from a duality (A, B). 
Proof, for details see flOl . 

We can illustrate this mapping detailing all performed applications as in figure 1 : 

(R*)' se M- 5 U [x] -A M' ^ H -u n x 
S x .— » I M .— » .— » .— » 

Definition 6.4 (Reproducing kernel of an evaluation subduality) Lef (W, .M) be a« evaluation subdu- 
ality with respect to map C-h associated with subduality kernel x. The reproducing kernel associated with 
this evaluation subduality is the function of two variables defined as follows: 

K : X x X — ► 1R 

(x,y) i — > K(x,y) = C H (x*(S y ), x(5 x )) 

This structure is illustrated in figure Q] Note that this kernel no longer needs to be definite positive. If 
the kernel is definite positive it is associated with a unique r.k.h.s. However, as shown in example 8.2.1 
it can also be associated with evaluation subdualities. A way of looking at things is to define k as the 
generalization of the Schwartz kernel while K is the generalization of the Aronszajn kernel to non hilbertian 
structures. Based on these definitions the important expression property is preserved. 

Proposition 6.4 (generation property) V/ S H, 3(ai) ie i such that f(x) w ^2 ieI otiK(x,Xi) and 
Vg G M, 3(o!j)j 6 j such that g(x) w ctiK(xi, x) 

Proof. This property is due to the density of Span{_K'(., x),x £ X} in Ti. For more details see [ 10] Lemma 4.3. 

Just like r.k.h.s, another important point is the possibility to build an evaluation subduality, and of course 
its kernel, starting from any duality. 

Proposition 6.5 (building evaluation subdualities) Let (A, B) be a duality with respect to map C A . Let 

{T Xl x G X} be a total family in A and {A x , x € X} be a total family in B. Let S (reps. T) be the linear 
mapping from A (reps. B) to R^ associated with T x (reps. A x ) as follows: 

S : A — ► R* T : B — ► R* 

g — > Sg(x)=C A (g,A x ) f .— » Tf(x)=C A (T x J) 

Then S and T are injective and (S(A) 1 T(B)) is an evaluation subduality with the reproducing kernel K 
defined by: 

K(x,y) = C A (Y X ,A V ) 
Proof, see 1 10 1 Lemma 4.5 and proposition 4.6 

An example of such subduality is obtained by mapping the (L 1 , L°°) duality to R* using injective opera- 
tors defined by the families T x (r) = !{a;<i-} an d ^y( T ) = l^ y<T y. 

T : L 1 — > R* 

/ 1 — > Tf(x) = (T x ,f) LOO)L i = J" I {x<t} /(t) dr 
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and 

S: L°° — ► H x 

g i — > Sg(y) = (g,A y ) LOOLl = / 5 (t)I {2/<t} dr 

In this case TL — Im(T), M — Ini{S) and K(y, x) = J A(y, t)T(x, t) dr = min(a;, y). We define the 
duality map between TL and M. through: 

C x (g 1 ,g 2 )=£ x (Sf 1 ,Tf 2 ) = £(/i,/ 2 ) 

See example 8.2.1 for details. 

All useful properties of r.k.h.s - pointwise evaluation, continuity of the evaluation functional, representa- 
tion and building technique - are preserved. A missing dot product has no consequence on this functional 
aspect of the learning problem. 



7 Representer theorem 

Another issue is of paramount practical importance: determining the shape of the solution. To this end 
representer theorem states that, when TL is a r.k.h.s, the solution of the minimization of the regularized cost 
defined equation © is a linear combination of the reproducing kernel evaluated at the training examples 
||9l [T51 . When hypothesis set TL is a reproducing space associated with a subduality we have the same 
kind of result. The solution lies in a finite n-dimensional subspace of TL. But we don't know yet how to 
systematically build a convenient generating family in this subspace. 

Theorem 7.1 (representer) Assume (TL,M) is a subduality of with kernel K(x,y). Assume the 
stabilizer f2 is convex and differentiable ( dn denotes its subdifferential set). 

If@nQ2 OLiK(xi, x)) C {^2fli6 Xi } E TL* then the solution of cost minimization lies in a n-dimensional 
subspace ofTL. 

Proof. Define a M subset Mi = {X)t=i ttiKfai, .)}. Let H 2 C TL be the Mi orthogonal in the sense of the 
duality map (i.e. V/ E Hi , Vg E Mi £(/, g) = 0). Then for all / E H 2 , f(xi) = 0, i = 1, n. Now let iii be the 
complement vector space defined such that 

TL = H 1 ®H 2 V/ EH 3/i E Hi and / 2 E # 2 such that / = /i + / 2 

The solution of the minimizing problem lies in Hi since: 

- V/ a E H 2 , C(/ 2 ) = constant 

- Q(/i + /a) > fl(/i) + (dn(fi), h) Mt -H (thanks to the convexity of Q) 

- and V/a eft,; (dn(fi), fc) M ,H = b y hypothesis 
By construction iii a n-dimensional subspace of TL. 

The nature of vector space Hi depends on kernel K and on regularizer fi. In some cases it is possible to 
be more precise and retrieve the nature of H\. Let's assume regularizer f2(/) is given. TL may be chosen 
as the set of function such that f2(/) < 00 ■ Then, if it is possible to build a subduality (7i, M) with kernel 
K such that 

E = VeCt{K( Xi , .)} ® ( Vect{X(., x,)}) T 

Hi A/7 

and if the vector space spaned by the kernel belongs to the regularizer subdifferential dfl(f): 

V/ e Hi, 3g E Mi such that g E <9Q(/) 
then solution /* of the minimization of the regularized empirical cost is a linear combination of the kernel: 

n 

f*( x ) = y^ j ajK(x i ,x) 
j=i 
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An example of such result is given with the following regularizer based on the p-norm on G = [0, 1]: 

m = f (ft dn 

Jo 

The hypothesis set is Sobolev space H p (the set of functions defined on [0, 1] whose generalized derivative 
is p-integrable) put in duality with H q (with 1/p + 1/q = 1) through the following duality map: 

Jo 

The associated kernel is just like in Cameron Martin case K (x, y) = min(x, y). Some tedious derivations 
lead to: 

Vh&H C(h,dQ(f))= [ h' p(/') p_1 dfi 
Jo 

Thus the kernel verifies p(K(., 1 K K(x 1 .) 

This question of the representer theorem is far from being closed. We are still looking for a way to derive 
a generating family from the kernel and the regularizer. To go more deeply into general and constructive 
results, a possible way to investigate is to go through Q, Fenchel dual. 



8 Examples 

8.1 Examples in Hilbert space 

The examples in this section all deal with r.k.h.s included in a L 2 space. 
1 . Schmidt ellipsoid: 

Let (X, n) be a measure space, {ej, i G 7} a basis of L 2 (X, /i) I being a countable set of indices. 
Any sequence {aj, i € I, Y^iei a f < +°°} defines a Hilbert-Schmidt operator on L 2 (X, fj) with 
kernel function T(x, y) = J^iei a i e i( x ) e i(l))< th us a reproducing kernel Hilbert space with kernel 
function: 

\/(x,y)eX 2 , K(x,y) =^2a 2 e l (x)e l (y) 

iel 



fi X 2 



The closed unit ball 05^ of the r.k.h.s verifies 

<B H =T(<B L2 ) = lfeL 2 J = J2f l e l , E(-) ^ 1 

and is then a Schmidt ellipsoid in L 2 . An interesting discussion about Schmidt ellipsoids and their 
applications to sample continuity of Gaussian measures may be found in (6|. 

2. Cameron-Martin space: 

Let T be the Carleman integral operator on L 2 ([0, (/x is the Lebesgue measure) with kernel 
function 

T(x,y) = Y(x — y) = l{ y < x } 

it defines a r.k.h.s with reproducing kernel K(x,y) = mm(x,y). The space (H; (., .}#) is the 
Sobolev space of degree 1, also called the Cameron-Martin space. 

H = {f absolutely continuous, 3f 6 L 2 ([0, 1]), f{x) = % f'dfi} 

(f,g}H = (fWh' 
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3. A Carleman but non Hilbert-Schmidt operator: 

Let T be the integral operator on L 2 (IR, n) (p, is the Lebesgue measure) with kernel function 

T{x,y) =exp-^ x -^ 2 

It is a Carleman integral operator, thus we can define a r.k.h.s (H ; (., .)h) — Im(T), but T is not a 
Hilbert-Schmidt operator. H reproducing kernel is: 

K(x,y) = icxp-i^) 2 

where Z is a suitable constant. 

4. Continuous kernel: 

This example is based on theorem 3.11 in lfT2l . Let A" be a compact subspace of IR, K(., .) a con- 
tinuous symmetric positive definite kernel. It defines a r.k.h.s (H; (., .)#) and any Radon measure 
/i of full support is kernel-injective. Then, for any such \i, there exists a Carleman operator T on 
L 2 (X, n) such that (H; (., .) H ) = Im(T). 

5. Hilbert space of constants: 

Let (H; (., .)h) be the Hilbert space of constant functions on IR with scalar product (f,g)ii = 
f(0)g(0). It is obviously a r.k.h.s with reproducing kernel K(., .) = 1. For any probability measure 
/i on IR let: 

V/ei 2 (lR,M), Tf= [ f(s)Kds) 

Then H = T(i 2 (R, M )) and V/, g E H, (/, g) H = (/, 

6. A non-separable r.k.h.s - the L 2 space of almost surely null functions: 

Define the positive definite kernel function on X C IR by Vs, t £ A 1 , K (s, t) — I/ S=t \. It defines 
a r.k.h.s (H; (., .)#) and its functions are null except on a countable set. Define a measure /i on 
(A, B) where B is the Borel a-algebra on X by fi(t) = 1 Vt G X. /i verifies: fi({t\, ■ ■ ■ , t n }) = n 
and fJ,(A) — +oo for any non-finite A E B. The kernel function is then square integrable and H is 
injectively included in L 2 (X, B, (i). Moreover, K(s, t) — J x K(t, u)K(u, s)d/i(u) with K Carle- 
man integrable and T — Id^i (note that the identity is a non-compact Carleman integral operator). 
Finally, (tf; (., =L 2 {X,B^). 

7. Separable r.k.h.s : 

Let H be a separable r.k.h.s . It is well known that any separable Hilbert space is isomorphic to 
£ 2 . Then there exists T kernel operator Im(T) = H. It is easy to construct effectively such a T: 
let {h n (.), n e N} be an orthonormal basis of H and define T kernel operator on £ 2 with kernel 

-> {h n (x), n E N}(e l 2 ). Then Im{T) = H. 

8.2 Other examples 

Applications to non-hilbertian spaces are also feasible: 

1. (L 1 , L°°) - "Cameron-Martin" evaluation subduality: 

Let T be the kernel operator on i 1 ([0, l]/i) {p, is the Lebesgue measure) with kernel function 

r(t, s ) = Y(t- a) = ! {s < t} , r(t, .) e l 00 ' 

it defines an evaluation duality (f/i; 7?oo) with reproducing kernel 

\/(s,t)EX 2 , K(s, t) = min(s, t) 

( Hx = {f absolutely continuous, 3f E L x ([0, 1]), f (t) = J * /'(s)ds} 
I ||/|k = II/' 11^ 

and 

/ H x = {f absolutely continuous, 3/' 6 L°°([0, 1]), f{t) = J* f'(s)ds} 

I II/IIh. = ll/'lk- 
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2. (R-^Il 1 * 1 ): 

We have seen that ft* endowed with the topology of simple convergence is an ETS. However, H x 
endowed with the topology of almost sure convergence is never an ETS unless every singleton of X 
has strictly positive measure. 

9 Conclusion 

It is always possible to learn without kernel. But even if it is not visible, one is hidden somewhere! We have 
shown, from some basic principles (we want to be able to compute the value of a hypothesis at any point 
and we want the evaluation functional to be continuous), how to derive a framework generalizing r.k.h.s to 
non-hilbertian spaces. In our reproducing kernel dualities, all r.k.h.s nice properties are preserved except 
the dot product replaced by a duality map. Based on the generalization of the hilbertian case, it is possible 
to build associated kernels thanks to simple operators. The construction of evaluation subdualities without 
Hilbert structure is easy within this framework (and rather new). The derivation of evaluation subdualities 
from any kernel operator has many practical outcome. First, such operators on separable Hilbert spaces 
can be represented by matrices, and we can build any separable r.k.h.s from well-known £ 2 structures (like 
wavelets in a L 2 space for instance). Furthermore, the set of kernel operators is a vector space whereas 
the set of evaluation subdualities is not (the set of r.k.h.s is for instance a convex cone), hence practical 
combination of such operators are feasible. On the other hand, from the bayesian point of view, this result 
may have many theoretical and practical implications in the theory of Gaussian or Laplacian measures and 
abstract Wiener spaces. 

Unfortunately, even if some work has been done, a general representer theorem is not available yet. We 
are looking for an automatic mechanism designing the shape of the solution of the learning problem in the 
following way: 

m k 

/( x ) = a ^( x *' x ) + foww 

i=l j=l 

where Kernel K, number of component m and functions ip^ (x) , j = l,k are derivated from regularizer 0. 
The remaining questions being: how to learn the coefficients and how to determine cost function? 
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